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Abstract 



We study a model of unsupervised learning where the real-valued data vectors are isotropically 
"j3 I distributed, except for a single symmetry breaking binary direction B £ { — 1, -1-1}^, onto which 

the projections have a Gaussian distribution. We show that a candidate vector J undergoing 
Gibbs learning in this discrete space, approaches the perfect match J = B exponentially. Besides 
i-rt , the second order "retarded learning" phase transition for unbiased distributions, we show that 

P3 ■ first order transitions can also occur. Extending the known result that the center of mass of the 

O ' Gibbs ensemble has Bayes-optimal performance, we show that taking the sign of the components 

O , of this vector (clipping) leads to the vector with optimal performance in the binary space. These 

upper bounds are shown generally not to be saturated with the technique of transforming the 
fv^ I components of a special continuous vector, except in asymptotic limits and in a special linear 

^ . case. Simulations are presented which are in excellent agreement with the theoretical results. 



PACS numbers: 64.60Cn, 87.10Sn, 02.50-r 



in 

\o 
m 

1 Introduction 

0^ . Since the introduction of the Ising spin model, the study of models with discrete degrees of freedom 

has become a core activity in statistical mechanics. When combined with disorder, such models often 
have interesting connections to problems of computational complexity, to learning theory or to open 
problems in statistics. Discreteness and disorder introduce intrinsic difficulties, and exactly solvable 
i-pj . models are rare. The main purpose of this paper is to present a discrete model with disorder which can 

(~| ' be solved in full detail. The model is most naturally presented as an unsupervised learning problem, 

O ■ and we briefly review the connection with the existing literature. 

J^ ' The goal of unsupervised learning is finding structure in high-dimensional data. In one of the 

simplest parametric models introduced in the literature ||l|, ||, ||, @, § I3, 0, H 0], A^-dimensional 
independently drawn data vectors D — {^^}, /i = 1, . . . ,aN, are uniformly distributed, except for a 
single symmetry-breaking direction B. If we assume that all the relevant probability distributions are 

C^ ' known, the aim of learning is to construct an estimate vector J of the true direction B. 

Previous studies of this model focused on the case where B is constrained to have a constant size, 
being otherwise equiprobably sampled from the A'^-sphere. This so-called spherical case is associated 
with a spherical prior distribution Ps{B) ~ 5{B ■ B — N). The focus of the present paper, however, is 
on binary (or Ising) vectors. In this case, B is known to have binary components only, Bj G { — 1, +1}, 
j = 1, . . . , N. This extra knowledge is taken into account by assigning a binary prior distribution 
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to the preferential direction. 

In this framework, a Gaussian scenario was introduced in [^ as a kind of minimal model, allowing 
the calculations to be much simplified and the spherical case to be solved exactly. In this model, the 
components of ^ perpendicular to B are assumed to be independent Gaussian distributed variables 
with zero mean and unit variance, i.e. P{b' = B' ■ ^/VN) = exp(— 6' /2)/^/2tt, where B' ■ B/N — 0. 
The distribution of the component b = B ■ ^/VN parallel to B, on the other hand, can be chosen at 
will, and in the Gaussian scenario it is completely determined by the mean B and variance 1 — A: 

P{b) = -^„p|-^-[/(6)| , (2) 

where Af = [J'Db exp— 1/(6)]~^ is a normalization constant and Vb = db exp[— 5^/2]/\/27r. 

In comparison with the spherical case, the binary case presents several extra difficulties, which 
motivates the study of this simple model. The main question to be addressed in this work is: given 
the aN data vectors (also called patterns) and the knowledge of the probability distributions, what 
is the best estimate J one can construct to approximate B? The answer, cast in the framework of 
Bayesian inference, depends on whether J is allowed to have continuous components or, conversely, 
is required to be a binary vector. We also address the problem of whether these upper bounds can 
be simply attained, by first obtaining a continuous vector via minimization of a potential and then 
transforming its components. 

The results of the replica calculation for this problem are briefly reviewed in section B. Section 3 
discusses the special case of Gibbs learning, for which simulations have been performed. In section 4 
we review the reasoning leading to the Bayesian bound in the continuous as well as the binary space, 
with simulations compared to the theoretical results. A simple strategy which attempts to saturate 
these upper bounds is studied in section ^, while our conclusions are presented in section ^. 

2 Unsupervised learning 

In order to obtain a good candidate vector J, we construct a cost function of the form H ~ ^" 1/(A^), 
where A^ = J • ^^/ViV. In the Gaussian scenario, the potential V has a quadratic form, 

ViX)^^X'-d\. (4) 

Learning is defined as sampling J from the Boltzmann distribution with temperature T — \/ (3 

P( T\ "^ 

P(J|i?) = -Liexp-/3^y(A^), (5) 

where Z{D) = J dJ P (J) exp — PH is the normalization constant and the measure P{J) is used to 
enforce either a binary {P{J) = Pb{J)) or a spherical {P{J) = Ps{J)) constraint on J. While the 
spherical case has been dealt with in B, we focus now on the case where the candidate vectors have 
binary components. The thermodynamic properties of such a system can be read from the free energy 
/ = —{l/PN)\nZ. In the thermodynamic limit N -^ oo, / becomes self-averaging, {f(D))jj = /, 
and can be calculated via the replica trick. This by now standard calculation will not be reproduced 
here, only the results are quoted: for a replica symmetric ansatz, the quadratic forms of the Gaussian 
scenario allow the calculations to be performed exactly, and the free energy reads 
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The above order parameters are also self-averaging quantities and can be interpreted as follows: 
R = B ■ J/N measures the alignment between a typical (binary) sample of eq. S and the preferential 
direction, its absolute value as a function of a being hereafter used to account for the performance of 
a given potential V; q ~ J ■ J' /N is the mutual overlap between two different samples of eq. o, while 
R and q are the associated conjugate parameters. The equilibrium values of the four variables are 
determined by the solution of the saddle point equations which arise from the extremum operator in 
eq. |. 

3 Gibbs learning 

Gibbs learning arises as a particular but very important case in this general framework. In order to 
define it properly, we first recall the Bayes inversion formula 



p(B|Z„.i25^gp) 



(7) 



The posterior distribution p{B\D) expresses the knowledge about B which is gained after the pre- 
sentation of the data. Replacing B with J in this formula gives the probability density that J is 
the "true" direction B, given the data vectors. Note that the binary prior in eq. constrains the 
acceptable candidates J to the corners of the A^-hypercube, i.e. J € {—1, -1-1}^. Making use of eq. g, 
one rewrites 



p{J\D) 



PbiJ)Y[e^p-u(j-e/ 



(8) 



apart from a normalization constant. Gibbs learning is defined as sampling from distribution y. 

A comparison with eq. |5| shows that the thermodynamics of such a process is obtained by setting 
/3F = {/ |, 1. Upon substitution oi P = 1, c = A/{1 - A) a,nd d ^ B/{1 - A) in eq. ^ one finds 
that the extremum of the corresponding free-energy is reached for qc = Rq and qQ ~ Rq, where the 
subscript G will hereafter be used to denote results from Gibbs learning. The equalities reflect the 
symmetric role played by J and B in Gibbs learning, a property which has been previously noted in 
several publications (see e.g. ||l2l and |1^, among others). The four original saddle point equations 
are then effectively reduced to a single one: 



i?G = i^|(^(/R^)) 



where 
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is a function coming from the entropic term of the free energy, while 

^B^+ARg{A-B^ 



1 - ARg 
The solution of eq. also determines the value of the conjugate parameter Rq' 



(11) 



Rg=T^[^/R^) . (12) 

In order to check that the rephca symmetric ansatz is correct, we also study the entropy s = 



(3^-§g[f — (lii2)//3], which for Gibbs learning reads 



SG = _ {^ + Rg)Rg ^ I 2?^i^2cosh( z\/Rg + Rg 



lr.['-j^]+{B'-A)il-RG) 



(13) 



On physical grounds, this quantity should always remain positive. Additionally, by relating sq with 
the mutual information i per degree of freedom between the data D and the preferential direction B, 
Herschkowitz and Nadal [pi show that it cannot decrease faster than linearly with a. For the Gaussian 
scenario, the inequality reads sg > ln2 — (a/2)[i?^ — A — ln(l — A)]. 

Before we proceed to study in detail the solution of eq. 0, we turn to the analysis of the asymptotic 
behavior of the system. 

3.1 Asymptotics 

The asymptotics of the solution of eq. S can be immediately inferred by carrying out expansions of 
Fb and J-'. In the vicinity of Rg ~ 0, if we assume a smooth behavior for RG{a), the predictions for 
the Gaussian scenario are: 



5^0 ^ Rg^ aB^ (14) 

B^O ^ Rg\ ^?; , , "J"^ (15) 

[ ~CG[a~aG), a > ttG , 

where 
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1 
A^ 



Co ^ j^. (16) 

We see that in the so called biased case -B 7^ 0, it is much easier to learn. The unbiased case B = 
presents much more difficulties for information about vector B to be extracted, due to the intrinsic 
symmetry B ^ —B. In this case, retarded learning occnis M, Bj, meaning that a non-zero macroscopic 
overlap Rg will be obtained only after a critical number of examples agTV is presented. For a < ac, 
the entropy saturates its linear bound exactly S . This second order phase transition is identical to the 
one obtained in the spherical case B, o , revealing that the binary nature of the preferential direction 
plays no role in the poor performance regime. 

In the limit a — *■ 00, on the other hand, the differences with respect to the spherical case become 
pronounced: Rg approaches 1 exponentially. 



^ - '^2a[B\\-A)+A^) ' ^ ^ 

as opposed to the power law observed for the spherical case. Eq. |^ also implies an exponential decay 
to the entropy, sg ^ a(\ - Rg){B'^{\ - A) + A^)l{2{\ - A)). 

These qualitative asymptotic results can be shown to hold for general distributions 7'(&) |13]. In 
the following, we explore the Gaussian scenario in more detail, studying the behavior of Rg (pt) away 
from the asymptotic regimes. 



3.2 The biased case 

The first case to be addressed is A = with B ^ 0. The non-zero bias makes sure learning starts off 
as soon as a > 0, while A = eliminates the dependence of Rq on Rq (see eqs. |l^ and |l2|), much 
simplifying the saddle point equations, which can be solved exactly. The behavior of Rq is seen to be 
simply determined by the rescaled variable 

a' = aB^ , (18) 

namely Rg — Fg ( \/oi'\ . This function can be seen in fig. ||. It shows a linear increase for small a' 

and an exponential behavior for a' — > cx). The entropy saturates its linear bound only in the limit 
a' —^ 0, approaching zero exponentially when a' ^ oo but remaining otherwise strictly positive. 
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a' = a B 

Figure 1: Overlap Rq (left axis) as a function of a' for A = (see eq. |l8|): theory (solid line) and 
simulations with N = 100 (symbols; error bars represent one standard deviation, see text for details). 
The dashed line represents the entropy (right axis) while the dotted line shows the linear bound (right 

axis) . 



Note that A — means that b has unit variance. The patterns can thus be pictured as being 
distributed in an iV-dimensional spherically symmetric cloud, whose displacement B from the origin 
conveys the information about B. 



Simulations Binary disordered systems are known to be very hard to simulate due to the existence 
of very many local minima. A noisy dynamics with unity temperature and general cost function U will 
typically get stuck in one of these minima, preventing a proper sampling of the posterior distribution M 
in an acceptable time. The Gaussian scenario with A = provides an exception to this rule, allowing 
Gibbs learning to be very easily implemented with a simple Metropolis algorithm |13[ . Since A = 
implies a linear function U{X), the changes in energy can be very quickly calculated because it depends 

only on J-E^^''- 

Fig. n shows the results for simulations with A^ = 100 (the smallest system size simulated) and 
two values of B, checking the relevance of the variable a' . For each pattern set £',10 samples of Rg 
and qg were measured, after a random initialization of the system and a warming up of the dynamics 
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Figure 2: Gaussian scenario with A = Q^ B = 1 and a — 1. Histograms of qc (thick hnes) and Rq 
(thin hnes) for N = 100 (dashed) and N = 1000 (sohd); the vertical hne is the theoretical prediction. 
The upper inset shows the Metropolis i?G-dynamics for two pattern sets (same legend, see text for 
details). The lower inset presents the variance of the distribution for Re (symbols) as a function of 
1/A^ for N = 100, 500, 750 and 1000; the dotted line is a linear fit of the three leftmost points. 



(see further details below). The whole procedure was repeated for 1000 pattern sets and the standard 
deviation was calculated over these 10000 samples. 

The measurement of qc during simulations is another tool to check both the property qQ — Rq 
and the correctness of the RS ansatz. Fig. |2| focus on the second simulated point of fig. |l| (a' — 1). 
It shows histograms for both Ra and qa (measured between pairs of consecutive samples) which are 
virtually indistinguishable on the scale of the figure, with a mean value in excellent agreement with 
the theoretical prediction. The upper inset gives a glimpse of the Metropolis dynamics: the system 
is initialized randomly at i = and evolves up to t = 50 Monte Carlo steps per site (MCS/site), at 
which moment a different pattern set is drawn. The system reaches thermal equilibrium after 0(10) 
MCS/site, which motivated the choice of safely waiting 100 MCS/site during the simulations before 
any measurement was made. The system was reinitialized after every measurement of the overlaps. 
Note that some pattern sets yield time-averaged values of Rg which deviate from theory (notably 
the first one for N = 100 and the second one for N = 1000) and only a second average over the 
pattern sets gives the right results. This reflects the property of self-averaging, which only holds in 
the thermodynamic limit (note that deviations from theory are smaller for larger N) . The lower inset 
shows the typical scaling with 1/^/N of the width of the distribution of overlaps. 

3.3 The unbiased case 

When B = 0, retarded learning is expected to occur, according to eq. |5|. Fig. || shows the solution of 
the Rg saddle point equation for two values of A, namely 0.6 and —0.6. In both cases, a second order 
phase transition occurs at the critical value ac predicted by eq. Eq and the entropy saturates exactly 
the linear bound before the phase transition. Based on the relation between sg and i Q, the retarded 
learning phase transition can be interpreted as follows: for a < ac, the system extracts maximal 
information from each pattern but is nonetheless unable to obtain a non-zero alignment Rg with the 
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Figure 3: Rq (thick lines, left axis) and sq (thin lines, right axis) as functions of a for A = ±0.6 
and B — 0. The thin dotted lines correspond to the linear bound, which is exactly saturated up to 
the second order phase transition. A small bias B = 0.1 (with A = 0.6) breaks the symmetry and 
destroys the second order phase transition (thick dotted line, left axis, entropy not shown). 



preferential direction B. Only at a = ac does Rq depart from zero, which on its turn immediately 
gives an increasing degree of redundancy (measured by the deviation of sq from its linear bound) to 
the patterns coming thereafter. Fig. || also shows the effect of a small bias _B = 0.1 in an otherwise 
symmetric distribution: for sufficiently large a (say, a 3> ac), the effect is negligible, but for small a 
the broken symmetry destroys the second order phase transition. 

It is interesting to note in fig. ^that even though the phase transition for A — —0.6 and A — 0.6 
occurs at the same critical value, the overlap increases much slower in the former case than in the latter. 
Recalling the definition oi A (eqs. 0-0), this means that prolate Gaussian distributions (A^-dimensional 
"cigars" M) convey less information about the preferential direction than oblate distributions (A^- 
dimensional "pancakes" S) for the same absolute value of A. 

However, the second order phase transition at a^ — A^'^ is not the only interesting phenomenon 
for this model. First order phase transitions are also possible, depending on the value of A. They 
can occur in two situations: cither for a > ac, in which case two consecutive phase transitions take 
place during learning (a second-order one followed by a first-order one), or a < ac, in which case the 
asymptotic result eq. 15^ is overridden. The first order phase transition appears when there is more 
than one solution to the saddle point equation. In such cases the solution with minimal free energy 
has maximal probability of occurrence, being thus the thermodynamically stable state. Such first 
order phase transitions have been found for the spherical case with a two-peaked distribution 0] , but 
not with the Gaussian scenario g] , which shows that they are due to the discrete nature of the search 
space in this case. 

An overview of this phenomenology is presented in fig. ]^ It shows the three typical behaviors 
that occur for B = 0. For comparison, the case A — 0.6 plotted in fig. ^ is shown again, as an 
example of a parameter region where there is only a second order phase transition (at ac = 2.78). 
For A — 0.78 the second order phase transition at olg = 1.643 is followed by a first order phase 
transition at a^ = 1.704 (upper inset, lower axis), while for A — 0.85 only a first order phase 
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Figure 4: Solutions Rg of the saddle point equations || and ^ as a function of a for B = and three 
values of A. sq is plotted with dashed lines and the thcrmodynamically stable solutions are plotted 
with thick lines. A = 0.78 (upper inset): Rg (left axis) vs. a (bottom axis) and sg (right axis) vs. 
a (top axis — note the different a scale, which zooms in the first order phase transition); A = 0.85 
(lower inset): Rg (left axis) and sg (right axis) vs. a. 



transition takes place at a = 1.27, overriding the second order phase transition at a — 1.38 (lower 
inset) which was predicted on asymptotics and smoothness grounds. Note that none of these first 
order phase transitions can be predicted by the asymptotic expansion, eq. Qq. It is also interesting 
to observe that some solutions of the saddle point equation may violate the linear bound and/or the 
positivity of the entropy (notably A — 0.85 in fig. Q). However, it turns out that these branches are 
always thcrmodynamically unstable, while the stable solutions satisfy all the requirements. 

The whole phase diagram for _B = is shown in fig. H. For A > Ai ~ 0.773, a first order 

phase transition takes place at the line a^ (A), after the second order one has already occurred. For 

( f) 
increasing A, a^ {A) gets closer and closer to ag (A) , until there is finally a collapse at A = A2 ~ 

0.808. For larger values of A, only the first order phase transition occurs. 



4 Optimal learning: the Bayesian perspective 

We now switch to the following question: given the aN data vectors and the prior information about 
B, what is the best performance R one could possibly attain with a vector J? Watkin and Nadal Q 
answered this question in a Bayesian framework by defining optimal learning (see also |lj, y_5)). We 
briefly review their reasoning here and extend it to take into account the binary nature of the vectors. 

We define the quality measure Q{B, J) = B ■ J/N, which quantifies how well B is approximated 
by any candidate vector J satisfying J ■ J = N. Since B is unknown, Q is formally inaccessible. 
But one can take its average with respect to the posterior distribution 0, leading to Q{J,D) = 
J dB Q{B, J)p{B\D). Q is then a formally accessible bona fide quantity which can be used to measure 
the performance of J. 

Optimal learning is defined as constructing a vector Jb which maximizes Q. The linearity of Q 
in J immediately implies 
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Figure 5: Phase diagram for the unbiased case [B = 0). Second order phase transitions occur at 
a = ac = A^"^ (sohd hne), while first order phase transitions take place at a = a^ (dashed hne). 
See text for details. 



Q{J, D) = N-^J ■ / dBBp{B\D) , 



leading on its turn to 



^JdBBp{B\D), 



(19) 



(20) 



where the \/Rg factor guarantees the proper normalization of Jb- This is the so-called Bayesian 
vector, which is the center of mass of the Gibbs ensemble. In the thermodynamic limit, its performance 
Rb = B ■ Jb/N is shown to be simply related to that of Gibbs learning [^ ||: Rb — \/Rg- 

4.1 The best binary 

Up to now the reasoning is fairly general. The whole procedure can actually be carried out without 
explicitly mentioning what the prior distribution P{B) is. For clarity, in the following Jb will specif- 
ically denote the Bayesian vector for a binary prior. Note, however, that despite being the center of 
mass of the ensemble of binary vectors sampled from the posterior distribution, Jb has real-valued 
components Q, in general. 

One would therefore like to address the next question: what is the best binary vector one can 
construct? In other words, what is the binary vector — inferable from the data — that outperforms 
— on average — any other binary vector in approximating Bl The answer is again straightfor- 
ward |16[ : the vector Jhf, which maximizes Q among the binary vectors is simply obtained by the 
clipping prescription, namely [Jbb\j — ^^S^{[Jb\j)^ j = 1, . . . , A^ or, in shorthand notation. 



Jbb = clip(JB) 



(21) 



This can be easily checked by noting that the quantity to be maximized (the r.h.s. of eq. |19|) is 
proportional to J2i=i ■^j['^b]j- In what follows, Jbb is called the best binary vector. 

Summarizing, if B is known to be binary, Jb is the best estimator one can provide. But if the 
estimator is required to be binary as well, then Ji,b is the optimal choice. 

The proof that Jb and Jbb are optimal estimators in their respective spaces, is relatively simple yj. 
What we show below is that maximal Q implies maximal alignment R with B, in the thermodynamic 
limit. For the best binary, one departs from the inequality Q{Jbb) — Q{J) > 0, VJ G { — 1, +1}^ and 
takes the average with respect to the data distribution: 



j dD P{D)\Q{Jbb) - Q{J) = 
dBPb{B) I dDP{D\B 



B J, 
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(22) 



If we now assume that Rbb = B ■ Jbb/N and R = B ■ Jbb/N are self-averaging, the average over the 
data can be bypassed and one obtains 



dBPb{B)[R, 



■bb 



R] > 



0, 



(23) 



Finally, one notices that Pb{B) is a uniform prior, making no distinction between any particular 
binary vector (this is reflected, for instance, in the free energy H being independent of the particular 
choice of B): this allows the last average in eq. ^ to be bypassed as well, leading to the stronger 
upper bound Rbb > R- 

4.2 Performance and Simulations 

The performance Rbb of the best binary vector can be explicitly calculated by extending previously 
obtained results for the clipping prescription pTJ, Esl . In II3| , Schietse et al. study the effect of a general 
transformation Jj = yN(f>{Jj)/ y/^~<j^jji) on the components of a properly normalized continuous 
vector J satisfying B ■ J/N — R. li B is binary, (j) is odd and R = B ■ J/N is self-averaging, then 
the following relation follows: 



R = 



f P{x) (j){x) dx 
[JP{x)(t>^x)dx]^^^ 



(24) 



where the variable x = BiJi is expected to be distributed independently of the index, because of the 
permutation symmetry among the axes. 

We are left then with the problem of calculating P{x), after which eq. |4|can be applied for 0(a;) = 
sign(a;), providing Rbb as a function of Rq- If J is uniformly distributed on the cone B ■ J — NR, then 
P{x) is just a Gaussian with mean R and variance 1 — R^ jl^. However this can hardly be expected 
to hold for the Bayesian vector, since it is a sum of Ising vectors. One would naively expect Jb to be 
closer to the corners of the A^-hypercube instead. To obtain the relevant Pcm{x) = P{x — Bi[Jb]i), 
we calculate the ?7i-th quenched moment of [«7b]i, 



{{[Ji 



*]i)™)i5 = -^ (z-"^ ^jdJPb{J)e~^^T''^^^^J, 



(25) 



The above expression can be evaluated again with the use of the replica trick. The replica symmetric 
result is 



{([Jb], 
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ID 



[RG{a)Y^'^ 



Vz 



tanh \z\^RG{a)+ RG{a)Bi 



(26) 



or, equivalcntly. 
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(i?G(a))"/2 



Vz 



tanh zy Rcia) + Rg{ol) 



(27) 



where the values of Rg{oi) and Rciot) should be taken at the solution of the saddle point equations O 
and O for Gibbs learning. Notice that the passage from eq. Eq to eq. E^ is valid only if B is binary. 
From eq. E^one immediately rewrites the probability distribution Pcm{x), by identifying a change of 



stochastic variables 



- R-i/2 



R/- tanh ( z 



'■G 



with z normally distributed: 



Pcm{x) = 



/Rq 



exp 



-1 
2R^ 



2 



^R^x 



/Rgx 



— Rg 



(28) 



^J2^lRG{l - RgX^) 

Note that, since Rg and Rg are simply related to each other (eqs. p| and |l^) , PcAiix) can always be 
parametrized in function of Rg only. In fig. ^, the probability distribution of y = x\/Rg is plotted 
for different values of Rq, illustrating the fact that \y\ < 1. 
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Figure 6: Probability distribution of y = i?i (./i) j for different values of Rg- 



Eq. 28 should be compared to the Gaussian distribution obtained in llq]. It shows that Jb is 



indeed closer to the corners of the iV-hypercube, optimally incorporating the information that B is 
binary. 

We have run simulations for A = and S = 1 as described in section 3^. For a system size 
N = 500, the center of mass was constructed with n — 50 samplers, being normalized afterwards. 
Each component of B and Jb was used to measure x, the procedure being repeated 100 times for each 
of the 100 pattern sets. A comparison between the resulting histogram and the theoretical prediction 
can be seen in fig. 0. The good agreement shows that eq. gs] correctly describes the statistical properties 
of the Bayesian vector. 

We can finally proceed to calculate the performance R^t, of the best binary vector. Making use of 
eqs. E4 and Eq, we make a change of variables to obtain 



Rbb — 



I Pcm{x) sign(x) dx = l-2H (\[k^) 



(29) 
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Figure 7: Gaussian scenario with A = Q and B = 1. Probability distribution of x = Bi[Jb]i for 
a = 0.25, a = 0.6 (left plot) and a — 1.17 (right plot) or, correspondingly, Rg = 0.2, Rq = 0.4 and 
Rg = 0.6. Solid line: simulations. Dashed line: theory (eq. ESh. 



where H{x) = J Vz. Making use of the relation between Rg and Rg, we finally write 



Rbb = l-2H{Fg^{RB))=l-2H[Fg^[^RG 
= 1 - 2H {T {Rb}) ^ I - 2H ( JI^] , 



(30) 
(31) 



where F^ is the inverse of Fb ■ 

Eq. ^ expresses an upper bound for binary candidate vectors J in approximating B, satisfying 
two obvious inequalities: Rg < Rbb ^ Rb- The asymptotic behavior of Rbb can always be written in 
terms of Rg'- in the poor performance regime {Rg ~^ 0), one recovers previous results for clipping a 
spherical vector ||l^, |l^, Rbt, — \/2Rg/7^, while in the large a regime (Rp -^ 0) a faster exponential 
decay is achieved than with Gibbs learning: 1 — Rbt, — (2/7r)(l — Rg) [[12| . 

As a spin-off of the calculation, we have also obtained the overlap between the Bayesian vector and 
its clipped counterpart, T = Jb ■ Jbb/N — N~^ ^ li^sljl- This quantity can be easily computed. 



1 

Rbb 

Rb 



Vz 



tanh z\^ Rg{ol) -f Rg{ol) 



(32) 



if one notices the counterintuitive identity j "Dz \ tanh(za -f a?)\ = Jl?zsign(2;a -|- a?), Va, which is 
proved in the appendix. The simple result F = Ri,b/RB immediately implies the equality ( J^f, — TJb) ■ 
{Jb — RbB) = 0, for which we still have not found a deeper interpretation. 

The curves _Rbb, Rb and F as functions of Rg are plotted on fig. pi, together with results for 
simulations with the same parameters as those of fig. M. The data is in excellent agreement with 
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Figure 8: Rbb^ Rb and F: comparison between simulations and theory, for different values of a. Error 
bars represent one standard deviation. The diagonal is plotted for comparison. 
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Figure 9: Histograms of the overlaps Rg, Rb, Rbb and F for a = 1.17. The vertical lines show the 
theoretical predictions. 



the theoretical results, errors generally remaining below the margin of one standard deviation. Note 
that F > ■y/2/7r, with equality holding only for Rq -^ 0. This result is another confirmation of the 



13 



picture that Jb lies closer to the binary vectors, since \/^2pK is the overlap between a continuous 
vector isotropically sampled from the iV-hypersphere and its clipped counterpart. Fig. S zooms in the 
fourth column of points of fig. g (a = 1.17), showing the histograms of the overlaps. One observes 
that the distribution of T is much sharper than the other ones, while the statistics of Rg is better 
because it has n = 50 times more samples. 

5 Transforming the components 

5.1 General results 

Since sampling the binary Gibbsian vectors is usually a very difficult task, the construction of the 
Bayesian vector according to the center of mass recipe is not always possible, in practice. Alternative 
methods should therefore be developed for approximating the Rb and Rhb performances. One such 
method is the technique of transforming the components of a previously obtained spherical vector. 



as described in section 4.2, In the following, wc first derive general results (for any function U) and 
subsequently look at the Gaussian scenario in detail. 

A natural choice for the vector to be transformed is J^^i, which can be obtained by minimizing 
— in the A^-hypersphere — an optimally constructed cost function l^j, ||, ^ ^ H = Yl,u ^opti^ti)- It 
attains the Bayes-optimal performance i?™^ = B ■ J^ ^/N for the spherical case, which satisfies 

Rlpt = Fs (J^iRlpt)) , (33) 



where Fs{x) = x/VT+x^. Note that J^pj saturates the performance of the center of mass of the 
Gibbs ensemble for a spherical prior. Eq. B3^ should be compared to the performance of Jb, which 
obeys 

Rb - Fb (HRb)) . (34) 

While Ropt — Rb for small a, the differences between Jb and J^^^ are clearly manifested in the 
asymptotic behavior for large a, with R^pt approaching unity with a power law |g|, ]^ instead of the 
exponential decay of eq. |l^. 

We would like to depart from J^^ and obtain approximations to both Jbb and Jb- The first one 
is obtained by clipping, J^i^^p = clip(J^ j). The second one relies on an optimal transformation iQ 
(j)*{x) = {P{x) — P{—x))/{P{x) + P{—x)), which maximizes the transformed overlap^ The vector 
obtained by such a transformation on J^p^ is denoted by J|, thus [Jl]j = (p* {[J^pf\j) / Rl, j — 1, . . . ,N, 
where Rl = B- J^/N. 

Since J^p^ contains no information about the binary nature of B, the results of Schietse et al. can 
be directly applied to render R^^p = B ■ J^up/N and i?J. In this case, P{x) is Gaussian and one 



obtains Q 



Rlup = l-2H{T{Kp,)) (35) 

Rl = Fb (HRopt)) ■ (36) 



We would like to compare eqs. 30 and g4| with eqs. |35| and 36, respectively. Despite their resem- 
blance in form, one notices that the former should be solved, while the latter just map the solution of 
eq. ^ In order to compare the equations, one should first note that Fb{x) > Fs{x), Wx > 0. Since 
dJ- /dR > 0, in general Rb > Ropt- This result in turn immediately implies the inequalities 

RcUp < Rbb (37) 
Rl < Rb, (38) 

^Consistently, the optimal transformation for P{x) = Pcm{^) becomes <j>*{x) ex x, that is, no improvement is 
possible. 
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Figure 10: Overlaps as functions of a for two choices of parameters in the Gaussian scenario: A — 1/3 
with B = 0.1 (upper curves) and A = —1/3 with B = (lower curves). The upper bounds Rb (solid) 
and Rbb (dashed) are depicted with thick lines, while the approximations Rl (solid) and i?^;™ (dashed) 
are plotted with thin lines. 



with equality holding for both equations in the asymptotic limits a ^ oo and Rq —f 0. This general 



behavior is confirmed in fig. 10, which shows the results for the Gaussian scenario in the two relevant 



cases: zero and non-zero bias. 

Another available measure of the success of the optimal transformation 0* in rendering a good 
approximation for Jb, is the probability distribution P{x^), where x^ = (p* {x) / R%. In order to obtain 
P(a;,), one just has to recall that P{x) is Gaussian with mean i?* ^ and variance 1 — (i?^ J^. The 
optimal transformation is then x* = (f)*{x)/Rl = ta,n\\{Rlp^x / {1 — [Rlp^Y))/ Rl, and can be regarded 
as an attempt to attach some structure to the distribution of the transformed x*. With a simple 
change of variables, P(a;*) is readily seen to be 



P{x^ 



RUn-iR; 



opt) 



27ri?^„,(l-(i?|x,)2) 



exp ■ 



-{1~{R, 



opt 



?) 



2{R- 



opt 



2 



1 + Rjx^ 
1 - Rtx,. 



\Ropt) 
^ \Ropt) 



(39) 



A comparison with eq. Eq shows that the two equations are very similar, but not identical. Some 
similarity in shape should indeed be expected, mainly because P(a;*), just like Pcm{x), must be such 
that {P{x) — P{—x))/{P{x) + P(—x)) oc a:, in order to consistently prevent any further improvement 
by a similar transformation. One can verify in fig. nil that the resemblance between the probability 
distributions is closely associated with the success of Rl in saturating the upper bound Rb- The 
curves correspond to the Gaussian scenario with A = 1/3 and B = 0.1 for two values of a (one can 
thus refer to the upper solid curves of fig. 11^). Note that for a = 8, the difference between Rb and Rl 
is very small in fig. |l^, which is reflected in the solid curves of fig. |l^ being very close to each other. 
Accordingly, the dashed curves in fig. 11 get further apart for a = 10 as the mismatch between the 
overlaps increase in fig. hfl. 
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Figure 11: Distributions PcAiix) (thick) and P(x*) (thin) according to eqs. pSJand eq. |9[ respectively. 
The values a = 8 (solid) and a = 10 (dashed) refer to the Gaussian scenario with A = 1/3 and B = 0.1 
(see fig. 0). 



5.2 The biased case 

The simple biased case with A = and B ^ provides an interesting exception to the performances 
of J^ and J^np- The fact that U{b) is linear implies dF/dR — 0, as can be readily verified in cq. |ll|. 
This, on the other hand, implies the equalities in eqs. p^ and pSL that is. 



R^ 



Rl 



— Rbb 

- Rb ■ 



(40) 
(41) 



Therefore the strategy described in the previous section is successful in attaining the opper bounds 
of section ^, and not only asymptotically. It should be noted that for a linear U, the vector J^ ^ 

can be simply constructed with the Hebbian rule, Jq„( oc ^ ^'^, Va. Therefore the best binary 
performance is attainable by the clipped Hebbian vector J^;j„, in this case. The second equality, 
however, seems to us more remarkable, because it stablishes a result which we could not find elsewhere 
in the literature: the optimal transformation manages to completely incorporate the information about 
the binary nature of B , leading to the Bayes-optimal performance Rb without the need of explicitly 
constructing the center of mass of the Gibbs ensemble. In other words, the technique of non-lincarly 
transforming the components of the vectors, introduced in [ p7[ and extended in [nsl, is able to give a 
definitive answer to the problem it aims to solve. 



6 Conclusions 

We have presented results on learning a binary preferential direction from disordered data. Constrain- 
ing the candidate vectors to the binary space as well, we first showed that Gibbs learning presents 
not only an exponential asymptotic decay and second order "retarded learning" phase transitions, but 
also first order phase transitions for a simple Gaussian scenario. 
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On the question of what is the optimal estimator, given the data and the knowledge that the 
preferential direction is binary, we have shown that the answer depends on which space the estimator 
J is allowed to lie in. The best continuous estimator is the Bayesian vector, which is the center of 
mass of binary Gibbsian vectors. The best binary vector is obtained by clipping the Bayesian vector. 
We have calculated its properties in detail, providing an upper bound to the performance of binary 
vectors. 

Finally, we have also studied one possible way of constructing approximations to these two optimal 
estimators. By transforming the components of a previously obtained continuous vector, we show that 
the upper bounds cannot be saturated, in general. Exceptions to this rule are the asymptotic limits 
(both Rq — > and a — *■ (X)) and the special case of a linear function U . Interestingly, the linear 
case also seems to be the only one in which Gibbs sampling can be performed without computational 
difficulties. We are therefore left with a situation where the approximations work perfectly only in 
the case where they are actually not needed. We believe this kind of result reinforces the need of 
investigating the connection between results in statistical mechanics and computational complexity 
theory. 
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Appendix 

In order to show that eq. |2|is correct, one just has to show that the integral below vanishes identically, 
Va: 



z=y-a 



I Vz [sign(az + a ) — | tanh(az + a^)|] 
/ Vz sign(a2: + a^) [l - tanh(az + a^)] 



dy 
, — ( 



-{y-a?/2 



sign(a2/) [1 — tanh(ay)] 



g-a /2 / VysigTL{ay)e"-y 
0. 



-ay 



cosh(ay) 



(42) 
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